Video stream switching

ABSTRACT

A technique is described for providing a client on a packet based network with a stream of encoded video data. The system is able to maximise the bit-rate of the video stream by adapting to fluctuations in network capacity. The technique is characterised in that adaptation of the bit-rate of the transmitted encoded video data is timed to occur upon a scene change in the video sequence. In this way the interruption to the viewer when the perceived quality of the video sequence increases or decreases is minimised as it is ‘hidden’ in the scene change. The technique is described as applied to hierarchically encoded video data but equally may be applied to other encoding techniques which adapt to network conditions.

[0001] The invention is in the field of video streaming over packetnetworks and in particular concerns the adaptive transmission of data inresponse to network congestion.

[0002] In recent years, the Internet has experienced a proliferation intransmission of real-time multimedia, mainly in the form of streamedaudio-visual content, either delivered live or from pre-recordedsources. Furthermore, traditional forms of multimedia such as streamingand conferencing are being followed by applications with richer contentsuch as Internet multi-channel TV and complex immersive environments.This increase in traffic will put strains on the network and it istherefore desirable that application programs are designed to respond tocongestion if stability of the network is to be maintained. It isdesirable that network conditions are monitored and output bit-ratesadjusted to the available bandwidth.

[0003] There is a more immediate advantage to the user in having anadaptive output bit-rate which is that the highest possible transmissionbit-rate that the network will allow is used and therefore the user willalways receive the best possible image quality. There are a number ofknown approaches to adaptive quality video streaming, one of which ishierarchical coding. In this technique the original video data isencoded into a number of discreet streams called layers, where the firstlayer consists of basic data of a relatively poor quality and wheresuccessive layers represent more detailed information so that layers canbe added to increase the image quality or layers can be taken away,depending on the available bandwidth.

[0004] As the bit-rate available to a session is subject to significantvariations in the number of layers that are transmitted, qualityfluctuations occur in the decoded image as layers are added or dropped.When layers are added or dropped frequently the fluctuations in qualitymay become disturbing to a viewer.

[0005] In a first embodiment of the present invention there is provideda method of operating a multimedia server, said method comprising:

[0006] providing a stream of video data representing a video sequence toan output of the multimedia server, wherein the output of the mediaserver is connected to a packet based network,

[0007] measuring a property of the video data in order to determine theoccurrence of a scene change in the video sequence,

[0008] detecting the available bandwidth on the network,

[0009] varying the bit-rate of the stream of video data,

[0010] wherein the method is characterised in that

[0011] variation in the bit-rate of the video data is controlled tooccur in response to variations in the capacity of the network (203) andpreferentially with a scene change in the video sequence.

[0012] The term preferentially used herein is meant to indicate that incertain circumstances (discussed in greater detail below) it may not bepractical to wait for a change of scene to occur in the video sequencebefore varying the bit-rate of the video stream, and in suchcircumstances the bit-rate will be changed at points in the video streamwhich do not correspond to a change of scene.

[0013] The term scene change (or change of scene) is intended to referto a sudden change in a video sequence within the space of a one or avery few frames such as typically occurs at a change of scene, whetheror not there has been an actual change of scene.

[0014] Embodiments of the present invention will now be described, byway of example only, with reference to the following figures, where;

[0015]FIG. 1 is a schematic diagram of the content-based inter-streamsession bandwidth sharing architecture.

[0016]FIG. 2 shows the order of packets in a layered coding system.

[0017]FIG. 3 is a diagram of a media server.

[0018]FIG. 4 is a diagram of the network interface of the media servershown in FIG. 3.

[0019]FIG. 5 is a diagram of a client.

[0020]FIG. 6 is a diagram of a client according to a first embodiment ofthe present invention.

[0021] The client/server arrangement for a known hierarchical streamingtechnology is shown in FIG. 1. A media server 202 is provided which hasaccess to compressed audiovisual data 201, which may be fed ‘live’ froman outside broadcast or may be pre-compressed and stored in a database.The data source 201 may be on the same premises as the media server 202and linked via an intranet. The media server 202 runs on a suitableserver computer and which has access to the Internet 203.

[0022] A video viewer, hitherto referred to as the client 204, runningon a PC suitably configured to have access to the Internet 203, mayconnect to the media server 202 via the Internet 203 and thus the client204 is able to access content. A suitable PC terminal is used.

[0023] Layered video compression is achieved with the 1998 version ofH.263 but equally may be any other codec, such as MPEG4. Each layer inthe hierarchy is coded in such a way as to allow the quality ofindividual pictures to be enhanced and their resolution to be increased,and additional pictures to be included to increase the overall picturerate, as explained with reference to FIG. 2. FIG. 2 illustrates atypical dependency between pictures in an H.263 scalable layered coder,with the boxes representing the frames for each layer and arrows showingthe dependency between frames. The lowest row shows original, un-codedframes. The next row shows the lowest layer (Layer 0) of the hierarchywhich is coded at half the frame rate of Layer 1. Frames in Layer 0 arepredicted from the previously encoded frame, as in conventional videocompression. Frames in Layer 1 may be predicted from the previouslyencoded frame in Layer 1 and, if present, the temporally simultaneousLayer 0 encoded frame. Frames in Layer 2 may be predicted from thepreviously encoded frame in Layer 2 and, if present, the temporallysimultaneous Layer 1 or Layer 0 encoded frame. The H.263 specificationallows for 15 layers; in the present embodiment server and clientsoftware is not limited in the number of layers that can be used, but inthis case a database has been generated with video streams containingfour layers.

[0024]FIG. 3 shows the architecture of media server 201. In the presentembodiment the audiovisual source is a database 301 of compressed videodata and the media server 201 is responsible for reading the compresseddata from a database, packetising and distributing it. The data isdistributed according to the Real-time Transport Protocol (RTP) whichprovides end-to-end network transport functions suitable forapplications transmitting real-time data such as audio, video orsimulation data. RTP does not address resource reservation and does notguarantee quality-of-service for real-time services. The data transportis augmented by a control protocol (RTCP) to allow monitoring of thedata delivery and to provide minimal control and identificationfunctionality. RTP and RTCP are designed to be independent of theunderlying transport and network layers. The media server 201 has fourmajor components: database reader 302, video RTP packetiser 303, audioRTP packetiser 305 and a network interface 304. Video and audio data arestored as files of compressed bits in the database 301. The databasereader 302 retrieves and synchronises the compressed audio and videofrom the database 301.

[0025] The audio and video data is then sent to a RTP packetiser303,305. Audio and video information is transmitted over the IP networkusing the User Datagram Protocol (UDP) and Real-time Transport Protocol(RTP) packetisation. UDP provides a checksum to detect transmissionerrors, but does not guarantee data delivery: packets may be lost,duplicated or re-ordered. RTP provides end-to-end delivery services,such as payload type identification, sequence numbering, time-stampingand delivery monitoring. RTP packetisers 303,305 attach the RTP Headerand, in the case of video packets, the H.263 Payload Header whichprovides some protection from packet loss. The Payload Header containsinformation specific to the video stream, such as motion vectorpredictors, which is obtained by decoding the compressed bit stream.

[0026] The rate at which data is read from the database 301 iscontrolled from the network interface 304, which is illustrated ingreater detail in FIG. 4. With reference to FIG. 4 the arrangement ofthe network interface 304 for the video data will now be described. Apacket numbering module 401 assigns a number to each packet of data.These packet numbers are used by the client to assemble packets into thesequence required for decoding. The number of bits produced bycompressing a picture is in general not constant, even when thetransmission rate is constant. The first picture of a new scene willusually produce a larger number of bits than a picture where there islittle movement or detail. The encoding process will use a controlstrategy together with an output data buffer to smooth these variationsin preparation for transmission at a constant bit rate. In the case oflayered coding, each layer has its own control strategy and buffer as ithas its own constant bit rate constraint to meet. The delay across eachof these buffers will not be constant and will not be the same for eachlayer. Delay variation between the layers is therefore introduced at thesource of the data. Transmission across a network may cause this delayvariation to increase. A packet numbering scheme is therefore needed toensure that a client can arrange the packets it receives into the orderrequired for decoding. This scheme is required to handle the cases ofthe client not receiving all of the layers and of packet loss in thelayers that it does receive. Each layer is transmitted as an independentRTP Session on a separate IP address by a session handler 403, 405, 407,409. The rate at which data is transmitted is controlled by the TransferRate Control module 402, which counts Layer 0 bytes to ensure that thecorrect number are transmitted in a given period of time. Thetransmission rate of the other layers is smoothed and locked to the rateof Layer 0 using First-In First-Out (FIFO) buffer elements 404, 406,408.

[0027] The client 204 of the known hierarchical streaming technologywill now be described with reference to FIG. 5. Each RTP/RTCP Sessionassociated with each layer of encoded data has a session handler 501,502, 503, 504 at the client which is responsible for receiving RTPpackets from the network. The packets are then passed to a sourcedemultiplex module 505. This receives packets from all layers and allsources, demultiplexes them by source, and routes them to a blendermodule 506, 507 for that source. The blender module 506, 507 receivespackets from all layers from one source in the order that they werereceived from the network. This may not be the order required fordecoding because of packet inter-arrival jitter or packet loss. Theblender module 506, 507 uses the packet numbers in the RTP headers toarrange the packets from each layer in order and then combines thepackets from all layers together. The output of the blender module 506,507 is a single stream of packets which are in the correct order fordecoding. This is the same order as they come out of the packetnumbering module 401 of the media server 202.

[0028] Packets are then sent to the decoder 508, 509 where the packetsare decoded into 20 ms blocks of audio samples or to video pictures. Inthe case of pictures, these are rendered to a window on the display 508.

[0029] Also provided is a congestion manager 511 to which the sessionhandlers 501-504 report packet loss. If packets are being consistentlylost, indicating network congestion, the congestion manager 511 willinstruct the session handler responsible for the highest layer ofcompressed data to terminate the RTP/RTCP Session. Periodically thecongestion manager 511 will instruct an experimental joining of a layervia the appropriate session handler to test whether there is availablebandwidth in the network. If this experiment is successful, i.e. thatsubstantially all of the packets of each layer are getting reaching theclient, then the new layer will be adopted. In this way the maximumavailable bandwidth is employed.

[0030] It possible for the congestion manager 509 to instruct layers tobe dropped and restored rapidly as the network bandwidth fluctuates.This could be annoying for a viewer and so a way of ‘hiding’ the shiftis employed. The layered encoding method described above is adapted sothat if a layer is to be dropped or added, the changeover preferentiallyoccurs during a scene-change in the video data. A video scene istypically perceived as the number of consecutive frames within a videosequence that do not show significant changes in the video content.Within a video scene or shot, the camera action may be fixed or mayexhibit a number of relatively uniform changes like panning, zooming,tracking etc. Scene changes may be recognised as abrupt transitions ofthe camera action or gradual transitions. In order to identify thescenes within a video sequence the assumption is that the levels ofmotion energy as well as those of luminance and colour do not changemuch between successive frames within a single scene. Techniques ofscene boundary identification include pixel differencing, motion vectorand block matching techniques. A very sudden change in the content oftemporally adjacent frames will thus appear as a change of scene;clearly it is conceivable that such a change might not actually relateto a scene change at all, but, for example, may instead be theconsequence of a large foreground object coming rapidly into view.However, such an abrupt change will nonetheless be a good place to addor remove a session layer and thus the term scene change is meant tocover such large changes in frame content from one frame to another.

[0031]FIG. 6 illustrates a system which detects congestion in thenetwork and also imposes the extra condition of adding or dropping alayer upon a scene change. The server 601 comprises the components ofthe media server 202 of FIG. 1 and the unit 603 comprises the componentsof the client 204 also shown in FIG. 1. Video data from the audiovisualsource is also passed to a content analyser 604 which segments the videosequence into scenes. The segmented video is then passed to an objectivequality assessment module (OQAM) 605 which assesses the contribution tothe perceived quality of a scene that each component layer provides.

[0032] These results are then passed to an inter layer adaptation module(ILAM) 607. The function of the ILAM 607 is to continuously calculatethe number of layers that maximise the perceived quality for thesession. The ILAM 607 also receives input from a sender congestionmanager 608 which reports on the bandwidth available to the session onthe network 203. The sender congestion manager 608 receives feedbackfrom a client congestion manager 609 on the number of packets that havebeen received. If this matches the number of packets that were sent thenthe bandwidth is known to be the current transmission rate. If packetsare being lost then the bandwidth available to the session is less thanthe transmission rate, in which case the sender congestion manager 608informs the ILAM 607 that a layer should be dropped.

[0033] In order to select which layer should be dropped the ILAM 607couples the bandwidth required by a layer with its contribution to thequality of the complete image, as calculated by the OQAM 605. The ILAM607 performs an exhaustive search on all of the bandwidth/qualityvalues. When the ILAM 607 has selected which layer is to be dropped froma particular scene, the timing of the drop is preferentially set tocoincide with the transmission of that particular scene. In this way thedrop in quality occurs with the scene change and is thus much lessnoticeable to a viewer than if the quality change had occurred during ascene. In the case that no packets have been lost for a preset period oftime the sender congestion manager 608 will request the ILAM 607 to addin a layer to test whether all of the available bandwidth is beingemployed. If no packets are lost during this experiment then the newlyadded layer is maintained. This process of experimentation with addinglayers is continued until a significant proportion of packets are lost,in which instance the system can be confident that all of the availablebandwidth is being employed. Again, the timing of the introduction of alayer is set to occur preferentially as the scene changes.

[0034] Under a given bit-rate allocation the levels of perceived qualitydo not change considerably within a scene, but scene cuts causeconsiderable changes in perceived quality, especially when the contentfeatures (spatial and motion energy) change a lot between subsequentscenes. As a consequence there will also be a significant difference inthe corresponding quality scores for those successive scenes which mayjustify a rescheduling of the number of layers in the stream.

[0035] The invention is not limited in use to hierarchical encodingschemes. The invention is suitable for use in any encoding techniquewhere the adaptation of transmission bit-rate to accommodatefluctuations in network capacity occurs. For instance the invention maybe applied to a transcoding system where encoded data streams aretranscoded from a high bit-rate to a low bit-rate or from a low bit-rateto a high bit-rate. The present invention would be suitable to reducethe impact for the viewer as the output bit-rate shifts in response tonetwork conditions by timing the transition to occur upon a scene changein the encoded video sequence. Another example of an adaptive videostreaming technique to which the invention may be applied is wheremultiple independent video streams of different bit-rates aretransmitted. In this case the client chooses which stream to acceptbased on session bandwidth. The client may switch from one stream toanother as bandwidth fluctuates; the present invention ensures that theswitch is timed to coincide with a scene change in the encoded videostream.

[0036] Naturally, there may be times when a change of scene is such along way away that it is advantageous to switch from one bit-rate toanother other than during a change of scene. For example, consider thecase where multiple independent video streams at different bit-rates areavailable for transmission by a media server to a client as, forexample, described in co-pending European patent application No.00310594.7 the contents of which are hereby incorporated herein by wayof reference. In such a case, the server may be capable of transmittinga first stream at a bit-rate of 500 kbit/s and a second, higher qualitystream at a bit-rate of 1500 kbit/s. The client may initially requestthat the server transmit the first stream at a transmission rate of 1000kbit/s. If the network is not congested and all of the packetstransmitted are successfully received by the client, the receive bufferat the client will start to fill with data at a rate of 500 kbit/s,since the client will only be removing data from the buffer at a rate of500 kbit/s. After say 10 seconds of filling up the buffer at this rate,the client will have a buffer of ten seconds worth of data at whichpoint it may decide it can attempt to receive the higher bit-rate secondstream of video data from the server and thus sends an appropriaterequest to the server to this effect. If, however, the server is awarethat the client has a receive buffer of 5 Mbytes size, it knows that itmay continue sending data from the first stream at the rate of 1000kbit/s for at least another 150 seconds before the receive bufferoverflows, causing problems for the client. Therefore, the server mayattempt to wait for a specified period to see if a scene change occursduring this interval. If so, a switch to the second higher bit-ratesignal is not made until the change of scene occurs. Of course, if thereis no change of scene within the determined period, the server switchesto the higher rate anyway. In this example, a period of only ten secondsis deemed appropriate as the waiting time.

[0037] Note that instead of simply waiting to see if a change of sceneoccurs in the specified period and switching at the end of the period ifno such change of scene is detected, an alternative method would be topre-analyse the video to be sent (clearly this only applies topre-stored video data and not live video) and to note when changes ofscene occur. In such a case, the server, upon receipt of a request fromthe client to switch streams, could search to see if a suitable changeof scene will occur within the predetermined period and if not to switchimmediately to the new bit stream.

[0038] In the present example, upon switching to the higher rate bitstream, the client may have requested a transmission rate of 1500 kbit/scorresponding to the rate at which the data will be drawn from thereceive buffer by the client. In such a case, the buffer size of 10seconds should remain constant so long as all of the transmitted packetsare successfully received by the client. However, in the event ofcongestion on the network, a proportion of packets may fail to arrive atthe client. In such a case, the server will be warned of this via theRTCP. If the congestion is sufficiently severe, the server may deducethat the buffer is in danger of emptying which would cause a break inthe video displayed by the client to occur. To prevent this, the servermay switch back to the lower bit rate stream. Via the notification ofhow many packets are being lost, the server can deduce how long it willbe before the buffer is emptied. For this time, the server can wait tosee if a change of scene occurs, and if so, the new stream will beswitched to at that point. Note that it would also be possible for theclient to simply request that the new stream be switched to upondetecting that its receive buffer is emptying at an unsustainable rate.

[0039] Note that the amount of data in the buffer is actually of lesssignificance than the rate of change of the amount of data in thebuffer. Thus it is preferably this quantity which either the server orthe client measures in order to determine whether to change from one bitstream to another (or whether to add or drop a layer in the firstexample).

1. A method of operating a multimedia server (202), said methodcomprising; providing a stream of video data representing a videosequence to an output of the multimedia server (202), wherein the outputof the media server (202) is connected to a packet based network (203),measuring a property of the video data in order to determine theoccurrence of a scene change in the video sequence, detecting theavailable bandwidth on the network (203), varying the bit-rate of thestream of video data, wherein the method is characterised in thatvariation in the bit-rate of the video data is controlled to occur inresponse to variations in the capacity of the network (203) andpreferentially with a scene change in the video sequence.
 2. A method ofoperating a multimedia server (202) in accordance with claim 1 whereinupon detecting a decrease in the available bandwidth the bit-rate of thevideo data is decreased.
 3. A method of operating a multimedia server(202) in accordance with claim 1 wherein the bit-rate of the video datais periodically increased in order to ensure that all of the availablebandwidth is being used.
 4. A method of operating a multimedia server202 in accordance with claims 1 to 3 wherein said stream of video datais comprised of a plurality of hierarchically encoded layers.
 5. Amethod of operating a multimedia server (202) in accordance with claim 4wherein the bit-rate of the video data is varied by adding or droppinglayers from the stream of data transferred to the output port.
 6. Amethod of operating a multimedia server (202) in accordance with claims4 or 5 wherein the objective quality contribution of each layer to theoverall quality of the video sequence is measured in order to determinewhich of the layers is to be added or dropped from the stream of datatransferred to the output port.
 7. A method of operating a multimediaserver 202 in accordance with claims 1 to 3 wherein said stream of videodata is comprised of a plurality of independently encoded flows each ofwhich encoded at different bit-rates.
 8. A method of operating amultimedia server (202) in accordance with claim 7 wherein the bit-rateof the video data is varied by switching between independently encodedflows.
 9. A method of operating a multimedia server 202 in accordancewith any of the preceding claims in which the source 201 of video datais stored in encoded form in a database.
 10. A method of operating amultimedia server 202 in accordance with any of claims 1 to 8 in whichthe source 201 of video data is encoded in real-time.
 11. A method ofoperating a multimedia server 202 in accordance with any of thepreceding claims wherein, upon detecting a variation in the capacity ofthe network such as to render it appropriate for a change in bit rate tooccur, the server waits for a predetermined time to attempt to detect achange of scene, and if such a change of scene is detected within thepredetermined time, the change of bit rate is caused to occur at thetime of the change of scene, but if a change of scene is not detectedwithin the predetermined time, then the change of bit rate is caused tooccur at the expiry of the predetermined time.
 12. A method of operatinga multimedia server 202 in accordance with any of the preceding claimswherein, upon detecting a variation in the capacity of the network suchas to render it appropriate for a change in bit rate to occur, theserver determines if there is going to be a change of scene within apredetermined time, and if it is determined that there will be such achange of scene within the predetermined time, the change of bit rate iscaused to occur at the time of the change of scene, but if it isdetermined that there is not going to be a change of scene within thepredetermined time, then the change of bit rate is caused to occursubstantially immediately after having made such a determination. 13.The method of claim 12 further including the step of, prior to providinga stream of video data to the output of the media server, processing thevideo data to determine the positions within the video data in whichchanges of scenes occur and making this information available forpermitting whether a scene change will occur within a predeterminedperiod at any point along the stream of video data.
 14. A multimediaserver (202) comprising; a reader (302) for reading video datarepresenting a video sequence from a source (301), a scene changedetector (604) for detecting changes of scene in the video sequence, abit-rate controller (607) for controlling the rate at which the videodata is transferred from the reader to an output port, wherein theoutput port is capable of communicating with a client (204) on a packetnetwork (203); the media server further comprising means (608) fordetecting available bandwidth between the multimedia server (202) andclient (204) on the packet network, characterised in that the bit-ratecontroller (607) is arranged to vary the bit-rate of the video datatransferred to the output port in dependence on the detected bandwidthpreferentially in correlation with a scene change detected by the scenechange detector (604).